Computation and Language
☆ H-Net++: Hierarchical Dynamic Chunking for Tokenizer-Free Language Modelling in Morphologically-Rich Languages
Byte-level language models eliminate fragile tokenizers but face
computational challenges in morphologically-rich languages (MRLs), where words
span many bytes. We propose H-NET++, a hierarchical dynamic-chunking model that
learns linguistically-informed segmentation through end-to-end training. Key
innovations include: (1) a lightweight Transformer context-mixer (1.9M
parameters) for cross-chunk attention, (2) a two-level latent hyper-prior for
document-level consistency, (3) specialized handling of orthographic artifacts
(e.g. Persian ZWNJ), and (4) curriculum-based training with staged sequence
lengths. On a 1.4B-token Persian corpus, H-NET++ achieves state-of-the-art
results: 0.159 BPB reduction versus BPE-based GPT-2-fa (12% better
compression), 5.4pp gain on ParsGLUE, 53% improved robustness to ZWNJ
corruption, and 73.8% F1 on gold morphological boundaries. Our learned chunks
align with Persian morphology without explicit supervision, demonstrating that
hierarchical dynamic chunking provides an effective tokenizer-free solution for
MRLs while maintaining computational efficiency.
☆ How Do LLMs Persuade? Linear Probes Can Uncover Persuasion Dynamics in Multi-Turn Conversations
Large Language Models (LLMs) have started to demonstrate the ability to
persuade humans, yet our understanding of how this dynamic transpires is
limited. Recent work has used linear probes, lightweight tools for analyzing
model representations, to study various LLM skills such as the ability to model
user sentiment and political perspective. Motivated by this, we apply probes to
study persuasion dynamics in natural, multi-turn conversations. We leverage
insights from cognitive science to train probes on distinct aspects of
persuasion: persuasion success, persuadee personality, and persuasion strategy.
Despite their simplicity, we show that they capture various aspects of
persuasion at both the sample and dataset levels. For instance, probes can
identify the point in a conversation where the persuadee was persuaded or where
persuasive success generally occurs across the entire dataset. We also show
that in addition to being faster than expensive prompting-based approaches,
probes can do just as well and even outperform prompting in some settings, such
as when uncovering persuasion strategy. This suggests probes as a plausible
avenue for studying other complex behaviours such as deception and
manipulation, especially in multi-turn settings and large-scale dataset
analysis where prompting-based methods would be computationally inefficient.
★ Learning to Reason for Factuality
Xilun Chen, Ilia Kulikov, Vincent-Pierre Berges, Barlas Oğuz, Rulin Shao, Gargi Ghosh, Jason Weston, Wen-tau Yih
Reasoning Large Language Models (R-LLMs) have significantly advanced complex
reasoning tasks but often struggle with factuality, generating substantially
more hallucinations than their non-reasoning counterparts on long-form
factuality benchmarks. However, extending online Reinforcement Learning (RL), a
key component in recent R-LLM advancements, to the long-form factuality setting
poses several unique challenges due to the lack of reliable verification
methods. Previous work has utilized automatic factuality evaluation frameworks
such as FActScore to curate preference data in the offline RL setting, yet we
find that directly leveraging such methods as the reward in online RL leads to
reward hacking in multiple ways, such as producing less detailed or relevant
responses. We propose a novel reward function that simultaneously considers the
factual precision, response detail level, and answer relevance, and applies
online RL to learn high quality factual reasoning. Evaluated on six long-form
factuality benchmarks, our factual reasoning model achieves an average
reduction of 23.1 percentage points in hallucination rate, a 23% increase in
answer detail level, and no degradation in the overall response helpfulness.
☆ Test-Time Reinforcement Learning for GUI Grounding via Region Consistency
Graphical User Interface (GUI) grounding, the task of mapping natural
language instructions to precise screen coordinates, is fundamental to
autonomous GUI agents. While existing methods achieve strong performance
through extensive supervised training or reinforcement learning with labeled
rewards, they remain constrained by the cost and availability of pixel-level
annotations. We observe that when models generate multiple predictions for the
same GUI element, the spatial overlap patterns reveal implicit confidence
signals that can guide more accurate localization. Leveraging this insight, we
propose GUI-RC (Region Consistency), a test-time scaling method that constructs
spatial voting grids from multiple sampled predictions to identify consensus
regions where models show highest agreement. Without any training, GUI-RC
improves accuracy by 2-3% across various architectures on ScreenSpot
benchmarks. We further introduce GUI-RCPO (Region Consistency Policy
Optimization), which transforms these consistency patterns into rewards for
test-time reinforcement learning. By computing how well each prediction aligns
with the collective consensus, GUI-RCPO enables models to iteratively refine
their outputs on unlabeled data during inference. Extensive experiments
demonstrate the generality of our approach: GUI-RC boosts
Qwen2.5-VL-3B-Instruct from 80.11% to 83.57% on ScreenSpot-v2, while GUI-RCPO
further improves it to 85.14% through self-supervised optimization. Our
approach reveals the untapped potential of test-time scaling and test-time
reinforcement learning for GUI grounding, offering a promising path toward more
robust and data-efficient GUI agents.
comment: Project Page: https://zju-real.github.io/gui-rcpo Code:
https://github.com/zju-real/gui-rcpo
★ OmniEAR: Benchmarking Agent Reasoning in Embodied Tasks
Zixuan Wang, Dingming Li, Hongxing Li, Shuo Chen, Yuchen Yan, Wenqi Zhang, Yongliang Shen, Weiming Lu, Jun Xiao, Yueting Zhuang
Large language models excel at abstract reasoning but their capacity for
embodied agent reasoning remains largely unexplored. We present OmniEAR, a
comprehensive framework for evaluating how language models reason about
physical interactions, tool usage, and multi-agent coordination in embodied
tasks. Unlike existing benchmarks that provide predefined tool sets or explicit
collaboration directives, OmniEAR requires agents to dynamically acquire
capabilities and autonomously determine coordination strategies based on task
demands. Through text-based environment representation, we model continuous
physical properties and complex spatial relationships across 1,500 scenarios
spanning household and industrial domains. Our systematic evaluation reveals
severe performance degradation when models must reason from constraints: while
achieving 85-96% success with explicit instructions, performance drops to
56-85% for tool reasoning and 63-85% for implicit collaboration, with compound
tasks showing over 50% failure rates. Surprisingly, complete environmental
information degrades coordination performance, indicating models cannot filter
task-relevant constraints. Fine-tuning improves single-agent tasks dramatically
(0.6% to 76.3%) but yields minimal multi-agent gains (1.5% to 5.5%), exposing
fundamental architectural limitations. These findings demonstrate that embodied
reasoning poses fundamentally different challenges than current models can
address, establishing OmniEAR as a rigorous benchmark for evaluating and
advancing embodied AI systems. Our code and data are included in the
supplementary materials and will be open-sourced upon acceptance.
comment: Project Page: https://zju-real.github.io/OmniEmbodied Code:
https://github.com/ZJU-REAL/OmniEmbodied
★ Cooper: Co-Optimizing Policy and Reward Models in Reinforcement Learning for Large Language Models
Large language models (LLMs) have demonstrated remarkable performance in
reasoning tasks, where reinforcement learning (RL) serves as a key algorithm
for enhancing their reasoning capabilities. Currently, there are two mainstream
reward paradigms: model-based rewards and rule-based rewards. However, both
approaches suffer from limitations: rule-based rewards lack robustness, while
model-based rewards are vulnerable to reward hacking. To address these issues,
we propose Cooper(Co-optimizing Policy Model and Reward Model), a RL framework
that jointly optimizes both the policy model and the reward model. Cooper
leverages the high precision of rule-based rewards when identifying correct
responses, and dynamically constructs and selects positive-negative sample
pairs for continued training the reward model. This design enhances robustness
and mitigates the risk of reward hacking. To further support Cooper, we
introduce a hybrid annotation strategy that efficiently and accurately
generates training data for the reward model. We also propose a reference-based
reward modeling paradigm, where the reward model takes a reference answer as
input. Based on this design, we train a reward model named VerifyRM, which
achieves higher accuracy on VerifyBench compared to other models of the same
size. We conduct reinforcement learning using both VerifyRM and Cooper. Our
experiments show that Cooper not only alleviates reward hacking but also
improves end-to-end RL performance, for instance, achieving a 0.54% gain in
average accuracy on Qwen2.5-1.5B-Instruct. Our findings demonstrate that
dynamically updating reward model is an effective way to combat reward hacking,
providing a reference for better integrating reward models into RL.
comment: Project Page: https://zju-real.github.io/cooper Code:
https://github.com/zju-real/cooper
☆ Uni-cot: Towards Unified Chain-of-Thought Reasoning Across Text and Vision
Luozheng Qin, Jia Gong, Yuqing Sun, Tianjiao Li, Mengping Yang, Xiaomeng Yang, Chao Qu, Zhiyu Tan, Hao Li
Chain-of-Thought (CoT) reasoning has been widely adopted to enhance Large
Language Models (LLMs) by decomposing complex tasks into simpler, sequential
subtasks. However, extending CoT to vision-language reasoning tasks remains
challenging, as it often requires interpreting transitions of visual states to
support reasoning. Existing methods often struggle with this due to limited
capacity of modeling visual state transitions or incoherent visual trajectories
caused by fragmented architectures.
To overcome these limitations, we propose Uni-CoT, a Unified Chain-of-Thought
framework that enables coherent and grounded multimodal reasoning within a
single unified model. The key idea is to leverage a model capable of both image
understanding and generation to reason over visual content and model evolving
visual states. However, empowering a unified model to achieve that is
non-trivial, given the high computational cost and the burden of training. To
address this, Uni-CoT introduces a novel two-level reasoning paradigm: A
Macro-Level CoT for high-level task planning and A Micro-Level CoT for subtask
execution. This design significantly reduces the computational overhead.
Furthermore, we introduce a structured training paradigm that combines
interleaved image-text supervision for macro-level CoT with multi-task
objectives for micro-level CoT. Together, these innovations allow Uni-CoT to
perform scalable and coherent multi-modal reasoning. Furthermore, thanks to our
design, all experiments can be efficiently completed using only 8 A100 GPUs
with 80GB VRAM each. Experimental results on reasoning-driven image generation
benchmark (WISE) and editing benchmarks (RISE and KRIS) indicates that Uni-CoT
demonstrates SOTA performance and strong generalization, establishing Uni-CoT
as a promising solution for multi-modal reasoning. Project Page and Code:
https://sais-fuxi.github.io/projects/uni-cot/
comment: https://sais-fuxi.github.io/projects/uni-cot/
☆ MathSmith: Towards Extremely Hard Mathematical Reasoning by Forging Synthetic Problems with a Reinforced Policy
Large language models have achieved substantial progress in mathematical
reasoning, yet their advancement is limited by the scarcity of high-quality,
high-difficulty training data. Existing synthesis methods largely rely on
transforming human-written templates, limiting both diversity and scalability.
We propose MathSmith, a novel framework for synthesizing challenging
mathematical problems to enhance LLM reasoning. Rather than modifying existing
problems, MathSmith constructs new ones from scratch by randomly sampling
concept-explanation pairs from PlanetMath, ensuring data independence and
avoiding contamination. To increase difficulty, we design nine predefined
strategies as soft constraints during rationales. We further adopts
reinforcement learning to jointly optimize structural validity, reasoning
complexity, and answer consistency. The length of the reasoning trace generated
under autoregressive prompting is used to reflect cognitive complexity,
encouraging the creation of more demanding problems aligned with
long-chain-of-thought reasoning. Experiments across five benchmarks,
categorized as easy & medium (GSM8K, MATH-500) and hard (AIME2024, AIME2025,
OlympiadBench), show that MathSmith consistently outperforms existing baselines
under both short and long CoT settings. Additionally, a weakness-focused
variant generation module enables targeted improvement on specific concepts.
Overall, MathSmith exhibits strong scalability, generalization, and
transferability, highlighting the promise of high-difficulty synthetic data in
advancing LLM reasoning capabilities.
☆ Iterative Learning of Computable Phenotypes for Treatment Resistant Hypertension using Large Language Models
Large language models (LLMs) have demonstrated remarkable capabilities for
medical question answering and programming, but their potential for generating
interpretable computable phenotypes (CPs) is under-explored. In this work, we
investigate whether LLMs can generate accurate and concise CPs for six clinical
phenotypes of varying complexity, which could be leveraged to enable scalable
clinical decision support to improve care for patients with hypertension. In
addition to evaluating zero-short performance, we propose and test a
synthesize, execute, debug, instruct strategy that uses LLMs to generate and
iteratively refine CPs using data-driven feedback. Our results show that LLMs,
coupled with iterative learning, can generate interpretable and reasonably
accurate programs that approach the performance of state-of-the-art ML methods
while requiring significantly fewer training examples.
comment: To appear in PMLR, Volume 298, Machine Learning for Healthcare, 2025
☆ Fairy$\pm i$: the First 2-bit Complex LLM with All Parameters in $\{\pm1, \pm i\}$
Feiyu Wang, Guoan Wang, Yihao Zhang, Shengfan Wang, Weitao Li, Bokai Huang, Shimao Chen, Zihan Jiang, Rui Xu, Tong Yang
Quantization-Aware Training (QAT) integrates quantization into the training
loop, enabling LLMs to learn robust low-bit representations, and is widely
recognized as one of the most promising research directions. All current QAT
research focuses on minimizing quantization error on full-precision models,
where the full-precision accuracy acts as an upper bound (accuracy ceiling). No
existing method has even attempted to surpass this ceiling. To break this
ceiling, we propose a new paradigm: raising the ceiling (full-precision model),
and then still quantizing it efficiently into 2 bits. We propose Fairy$\pm i$,
the first 2-bit quantization framework for complex-valued LLMs. Specifically,
our method leverages the representational advantages of the complex domain to
boost full-precision accuracy. We map weights to the fourth roots of unity
$\{\pm1, \pm i\}$, forming a perfectly symmetric and information-theoretically
optimal 2-bit representation. Importantly, each quantized weight has either a
zero real or imaginary part, enabling multiplication-free inference using only
additions and element swaps. Experimental results show that Fairy$\pm i$
outperforms the ceiling of existing 2-bit quantization approaches in terms of
both PPL and downstream tasks, while maintaining strict storage and compute
efficiency. This work opens a new direction for building highly accurate and
practical LLMs under extremely low-bit constraints.
comment: 13 pages, 14 figures
☆ SPGISpeech 2.0: Transcribed multi-speaker financial audio for speaker-tagged transcription
Raymond Grossman, Taejin Park, Kunal Dhawan, Andrew Titus, Sophia Zhi, Yulia Shchadilova, Weiqing Wang, Jagadeesh Balam, Boris Ginsburg
We introduce SPGISpeech 2.0, a dataset suitable for speaker-tagged
transcription in the financial domain. SPGISpeech 2.0 improves the diversity of
applicable modeling tasks while maintaining the core characteristic of the
original SPGISpeech dataset: audio snippets and their corresponding fully
formatted text transcriptions, usable for end-to-end automatic speech
recognition (ASR). SPGISpeech 2.0 consists of 3,780 additional hours of
professionally transcribed earnings calls. Furthermore, the dataset contains
call and speaker information for each audio snippet facilitating multi-talker
ASR. We validate the utility of SPGISpeech 2.0 through improvements in
speaker-tagged ASR performance of popular speech recognition models after
fine-tuning on SPGISpeech 2.0. Released free for non-commercial use, we expect
SPGISpeech 2.0 to foster advancements in speech recognition technologies and
inspire a wide range of research applications.
comment: To be presented at Interspeech 2025
☆ Do Political Opinions Transfer Between Western Languages? An Analysis of Unaligned and Aligned Multilingual LLMs
Public opinion surveys show cross-cultural differences in political opinions
between socio-cultural contexts. However, there is no clear evidence whether
these differences translate to cross-lingual differences in multilingual large
language models (MLLMs). We analyze whether opinions transfer between languages
or whether there are separate opinions for each language in MLLMs of various
sizes across five Western languages. We evaluate MLLMs' opinions by prompting
them to report their (dis)agreement with political statements from voting
advice applications. To better understand the interaction between languages in
the models, we evaluate them both before and after aligning them with more left
or right views using direct preference optimization and English alignment data
only. Our findings reveal that unaligned models show only very few significant
cross-lingual differences in the political opinions they reflect. The political
alignment shifts opinions almost uniformly across all five languages. We
conclude that in Western language contexts, political opinions transfer between
languages, demonstrating the challenges in achieving explicit socio-linguistic,
cultural, and political alignment of MLLMs.
★ Conformal Sets in Multiple-Choice Question Answering under Black-Box Settings with Provable Coverage Guarantees
Large Language Models (LLMs) have shown remarkable progress in
multiple-choice question answering (MCQA), but their inherent unreliability,
such as hallucination and overconfidence, limits their application in high-risk
domains. To address this, we propose a frequency-based uncertainty
quantification method under black-box settings, leveraging conformal prediction
(CP) to ensure provable coverage guarantees. Our approach involves multiple
independent samplings of the model's output distribution for each input, with
the most frequent sample serving as a reference to calculate predictive entropy
(PE). Experimental evaluations across six LLMs and four datasets (MedMCQA,
MedQA, MMLU, MMLU-Pro) demonstrate that frequency-based PE outperforms
logit-based PE in distinguishing between correct and incorrect predictions, as
measured by AUROC. Furthermore, the method effectively controls the empirical
miscoverage rate under user-specified risk levels, validating that sampling
frequency can serve as a viable substitute for logit-based probabilities in
black-box scenarios. This work provides a distribution-free model-agnostic
framework for reliable uncertainty quantification in MCQA with guaranteed
coverage, enhancing the trustworthiness of LLMs in practical applications.
comment: under review
☆ Mixed-Initiative Dialog for Human-Robot Collaborative Manipulation
Albert Yu, Chengshu Li, Luca Macesanu, Arnav Balaji, Ruchira Ray, Raymond Mooney, Roberto Martín-Martín
Effective robotic systems for long-horizon human-robot collaboration must
adapt to a wide range of human partners, whose physical behavior, willingness
to assist, and understanding of the robot's capabilities may change over time.
This demands a tightly coupled communication loop that grants both agents the
flexibility to propose, accept, or decline requests as they coordinate toward
completing the task effectively. We apply a Mixed-Initiative dialog paradigm to
Collaborative human-roBot teaming and propose MICoBot, a system that handles
the common scenario where both agents, using natural language, take initiative
in formulating, accepting, or rejecting proposals on who can best complete
different steps of a task. To handle diverse, task-directed dialog, and find
successful collaborative strategies that minimize human effort, MICoBot makes
decisions at three levels: (1) a meta-planner considers human dialog to
formulate and code a high-level collaboration strategy, (2) a planner optimally
allocates the remaining steps to either agent based on the robot's capabilities
(measured by a simulation-pretrained affordance model) and the human's
estimated availability to help, and (3) an action executor decides the
low-level actions to perform or words to say to the human. Our extensive
evaluations in simulation and real-world -- on a physical robot with 18 unique
human participants over 27 hours -- demonstrate the ability of our method to
effectively collaborate with diverse human users, yielding significantly
improved task success and user experience than a pure LLM baseline and other
agent allocation models. See additional videos and materials at
https://robin-lab.cs.utexas.edu/MicoBot/.
comment: Project website at https://robin-lab.cs.utexas.edu/MicoBot/
☆ CoCoLex: Confidence-guided Copy-based Decoding for Grounded Legal Text Generation ACL 2025
Santosh T. Y. S. S, Youssef Tarek Elkhayat, Oana Ichim, Pranav Shetty, Dongsheng Wang, Zhiqiang Ma, Armineh Nourbakhsh, Xiaomo Liu
Due to their ability to process long and complex contexts, LLMs can offer key
benefits to the Legal domain, but their adoption has been hindered by their
tendency to generate unfaithful, ungrounded, or hallucinatory outputs. While
Retrieval-Augmented Generation offers a promising solution by grounding
generations in external knowledge, it offers no guarantee that the provided
context will be effectively integrated. To address this, context-aware decoding
strategies have been proposed to amplify the influence of relevant context, but
they usually do not explicitly enforce faithfulness to the context. In this
work, we introduce Confidence-guided Copy-based Decoding for Legal Text
Generation (CoCoLex)-a decoding strategy that dynamically interpolates the
model produced vocabulary distribution with a distribution derived based on
copying from the context. CoCoLex encourages direct copying based on the
model's confidence, ensuring greater fidelity to the source. Experimental
results on five legal benchmarks demonstrate that CoCoLex outperforms existing
context-aware decoding methods, particularly in long-form generation tasks.
comment: Accepted to ACL 2025-Main Conference
☆ The World According to LLMs: How Geographic Origin Influences LLMs' Entity Deduction Capabilities
Large Language Models (LLMs) have been extensively tuned to mitigate explicit
biases, yet they often exhibit subtle implicit biases rooted in their
pre-training data. Rather than directly probing LLMs with human-crafted
questions that may trigger guardrails, we propose studying how models behave
when they proactively ask questions themselves. The 20 Questions game, a
multi-turn deduction task, serves as an ideal testbed for this purpose. We
systematically evaluate geographic performance disparities in entity deduction
using a new dataset, Geo20Q+, consisting of both notable people and culturally
significant objects (e.g., foods, landmarks, animals) from diverse regions. We
test popular LLMs across two gameplay configurations (canonical 20-question and
unlimited turns) and in seven languages (English, Hindi, Mandarin, Japanese,
French, Spanish, and Turkish). Our results reveal geographic disparities: LLMs
are substantially more successful at deducing entities from the Global North
than the Global South, and the Global West than the Global East. While
Wikipedia pageviews and pre-training corpus frequency correlate mildly with
performance, they fail to fully explain these disparities. Notably, the
language in which the game is played has minimal impact on performance gaps.
These findings demonstrate the value of creative, free-form evaluation
frameworks for uncovering subtle biases in LLMs that remain hidden in standard
prompting setups. By analyzing how models initiate and pursue reasoning goals
over multiple turns, we find geographic and cultural disparities embedded in
their reasoning processes. We release the dataset (Geo20Q+) and code at
https://sites.google.com/view/llmbias20q/home.
comment: Conference on Language Modeling 2025
☆ LAG: Logic-Augmented Generation from a Cartesian Perspective
Large language models (LLMs) have demonstrated remarkable capabilities across
a wide range of tasks, yet exhibit critical limitations in knowledge-intensive
tasks, often generating hallucinations when faced with questions requiring
specialized expertise. While retrieval-augmented generation (RAG) mitigates
this by integrating external knowledge, it struggles with complex reasoning
scenarios due to its reliance on direct semantic retrieval and lack of
structured logical organization. Inspired by Cartesian principles from
\textit{Discours de la m\'ethode}, this paper introduces Logic-Augmented
Generation (LAG), a novel paradigm that reframes knowledge augmentation through
systematic question decomposition and dependency-aware reasoning. Specifically,
LAG first decomposes complex questions into atomic sub-questions ordered by
logical dependencies. It then resolves these sequentially, using prior answers
to guide context retrieval for subsequent sub-questions, ensuring stepwise
grounding in logical chain. To prevent error propagation, LAG incorporates a
logical termination mechanism that halts inference upon encountering
unanswerable sub-questions and reduces wasted computation on excessive
reasoning. Finally, it synthesizes all sub-resolutions to generate verified
responses. Experiments on four benchmark datasets demonstrate that LAG
significantly enhances reasoning robustness, reduces hallucination, and aligns
LLM problem-solving with human cognition, offering a principled alternative to
existing RAG systems.
☆ MELLA: Bridging Linguistic Capability and Cultural Groundedness for Low-Resource Language MLLMs
Multimodal Large Language Models (MLLMs) have shown remarkable performance in
high-resource languages. However, their effectiveness diminishes significantly
in the contexts of low-resource languages. Current multilingual enhancement
methods are often limited to text modality or rely solely on machine
translation. While such approaches help models acquire basic linguistic
capabilities and produce "thin descriptions", they neglect the importance of
multimodal informativeness and cultural groundedness, both of which are crucial
for serving low-resource language users effectively. To bridge this gap, in
this study, we identify two significant objectives for a truly effective MLLM
in low-resource language settings, namely 1) linguistic capability and 2)
cultural groundedness, placing special emphasis on cultural awareness. To
achieve these dual objectives, we propose a dual-source strategy that guides
the collection of data tailored to each goal, sourcing native web alt-text for
culture and MLLM-generated captions for linguistics. As a concrete
implementation, we introduce MELLA, a multimodal, multilingual dataset.
Experiment results show that after fine-tuning on MELLA, there is a general
performance improvement for the eight languages on various MLLM backbones, with
models producing "thick descriptions". We verify that the performance gains are
from both cultural knowledge enhancement and linguistic capability enhancement.
Our dataset can be found at https://opendatalab.com/applyMultilingualCorpus.
☆ Can Large Language Models Generate Effective Datasets for Emotion Recognition in Conversations?
Emotion recognition in conversations (ERC) focuses on identifying emotion
shifts within interactions, representing a significant step toward advancing
machine intelligence. However, ERC data remains scarce, and existing datasets
face numerous challenges due to their highly biased sources and the inherent
subjectivity of soft labels. Even though Large Language Models (LLMs) have
demonstrated their quality in many affective tasks, they are typically
expensive to train, and their application to ERC tasks--particularly in data
generation--remains limited. To address these challenges, we employ a small,
resource-efficient, and general-purpose LLM to synthesize ERC datasets with
diverse properties, supplementing the three most widely used ERC benchmarks. We
generate six novel datasets, with two tailored to enhance each benchmark. We
evaluate the utility of these datasets to (1) supplement existing datasets for
ERC classification, and (2) analyze the effects of label imbalance in ERC. Our
experimental results indicate that ERC classifier models trained on the
generated datasets exhibit strong robustness and consistently achieve
statistically significant performance improvements on existing ERC benchmarks.
comment: 8 pages, 4 figures
☆ Rethinking Creativity Evaluation: A Critical Analysis of Existing Creativity Evaluations
We systematically examine, analyze, and compare representative creativity
measures--creativity index, perplexity, syntactic templates, and
LLM-as-a-Judge--across diverse creative domains, including creative writing,
unconventional problem-solving, and research ideation. Our analyses reveal that
these metrics exhibit limited consistency, capturing different dimensions of
creativity. We highlight key limitations, including the creativity index's
focus on lexical diversity, perplexity's sensitivity to model confidence, and
syntactic templates' inability to capture conceptual creativity. Additionally,
LLM-as-a-Judge shows instability and bias. Our findings underscore the need for
more robust, generalizable evaluation frameworks that better align with human
judgments of creativity.
comment: 15 pages, 6 figures
☆ TASE: Token Awareness and Structured Evaluation for Multilingual Language Models
While large language models (LLMs) have demonstrated remarkable performance
on high-level semantic tasks, they often struggle with fine-grained,
token-level understanding and structural reasoning--capabilities that are
essential for applications requiring precision and control. We introduce TASE,
a comprehensive benchmark designed to evaluate LLMs' ability to perceive and
reason about token-level information across languages. TASE covers 10 tasks
under two core categories: token awareness and structural understanding,
spanning Chinese, English, and Korean, with a 35,927-instance evaluation set
and a scalable synthetic data generation pipeline for training. Tasks include
character counting, token alignment, syntactic structure parsing, and length
constraint satisfaction. We evaluate over 30 leading commercial and open-source
LLMs, including O3, Claude 4, Gemini 2.5 Pro, and DeepSeek-R1, and train a
custom Qwen2.5-14B model using the GRPO training method. Results show that
human performance significantly outpaces current LLMs, revealing persistent
weaknesses in token-level reasoning. TASE sheds light on these limitations and
provides a new diagnostic lens for future improvements in low-level language
understanding and cross-lingual generalization. Our code and dataset are
publicly available at https://github.com/cyzcz/Tase .
☆ Bench-2-CoP: Can We Trust Benchmarking for EU AI Compliance?
Matteo Prandi, Vincenzo Suriani, Federico Pierucci, Marcello Galisai, Daniele Nardi, Piercosma Bisconti
The rapid advancement of General Purpose AI (GPAI) models necessitates robust
evaluation frameworks, especially with emerging regulations like the EU AI Act
and its associated Code of Practice (CoP). Current AI evaluation practices
depend heavily on established benchmarks, but these tools were not designed to
measure the systemic risks that are the focus of the new regulatory landscape.
This research addresses the urgent need to quantify this "benchmark-regulation
gap." We introduce Bench-2-CoP, a novel, systematic framework that uses
validated LLM-as-judge analysis to map the coverage of 194,955 questions from
widely-used benchmarks against the EU AI Act's taxonomy of model capabilities
and propensities. Our findings reveal a profound misalignment: the evaluation
ecosystem is overwhelmingly focused on a narrow set of behavioral propensities,
such as "Tendency to hallucinate" (53.7% of the corpus) and "Discriminatory
bias" (28.9%), while critical functional capabilities are dangerously
neglected. Crucially, capabilities central to loss-of-control scenarios,
including evading human oversight, self-replication, and autonomous AI
development, receive zero coverage in the entire benchmark corpus. This
translates to a near-total evaluation gap for systemic risks like "Loss of
Control" (0.4% coverage) and "Cyber Offence" (0.8% coverage). This study
provides the first comprehensive, quantitative analysis of this gap, offering
critical insights for policymakers to refine the CoP and for developers to
build the next generation of evaluation tools, ultimately fostering safer and
more compliant AI.
★ LLMEval-3: A Large-Scale Longitudinal Study on Robust and Fair Evaluation of Large Language Models
Ming Zhang, Yujiong Shen, Jingyi Deng, Yuhui Wang, Yue Zhang, Junzhe Wang, Shichun Liu, Shihan Dou, Huayu Sha, Qiyuan Peng, Changhao Jiang, Jingqi Tong, Yilong Wu, Zhihao Zhang, Mingqi Wu, Zhiheng Xi, Mingxu Chai, Tao Liang, Zhihui Fei, Zhen Wang, Mingyang Wan, Guojun Ma, Tao Gui, Qi Zhang, Xuanjing Huang
Existing evaluation of Large Language Models (LLMs) on static benchmarks is
vulnerable to data contamination and leaderboard overfitting, critical issues
that obscure true model capabilities. To address this, we introduce LLMEval-3,
a framework for dynamic evaluation of LLMs. LLMEval-3 is built on a proprietary
bank of 220k graduate-level questions, from which it dynamically samples unseen
test sets for each evaluation run. Its automated pipeline ensures integrity via
contamination-resistant data curation, a novel anti-cheating architecture, and
a calibrated LLM-as-a-judge process achieving 90% agreement with human experts,
complemented by a relative ranking system for fair comparison. An 20-month
longitudinal study of nearly 50 leading models reveals a performance ceiling on
knowledge memorization and exposes data contamination vulnerabilities
undetectable by static benchmarks. The framework demonstrates exceptional
robustness in ranking stability and consistency, providing strong empirical
validation for the dynamic evaluation paradigm. LLMEval-3 offers a robust and
credible methodology for assessing the true capabilities of LLMs beyond
leaderboard scores, promoting the development of more trustworthy evaluation
standards.
☆ MyCulture: Exploring Malaysia's Diverse Culture under Low-Resource Language Constraints
Large Language Models (LLMs) often exhibit cultural biases due to training
data dominated by high-resource languages like English and Chinese. This poses
challenges for accurately representing and evaluating diverse cultural
contexts, particularly in low-resource language settings. To address this, we
introduce MyCulture, a benchmark designed to comprehensively evaluate LLMs on
Malaysian culture across six pillars: arts, attire, customs, entertainment,
food, and religion presented in Bahasa Melayu. Unlike conventional benchmarks,
MyCulture employs a novel open-ended multiple-choice question format without
predefined options, thereby reducing guessing and mitigating format bias. We
provide a theoretical justification for the effectiveness of this open-ended
structure in improving both fairness and discriminative power. Furthermore, we
analyze structural bias by comparing model performance on structured versus
free-form outputs, and assess language bias through multilingual prompt
variations. Our evaluation across a range of regional and international LLMs
reveals significant disparities in cultural comprehension, highlighting the
urgent need for culturally grounded and linguistically inclusive benchmarks in
the development and assessment of LLMs.
☆ The TUB Sign Language Corpus Collection
Eleftherios Avramidis, Vera Czehmann, Fabian Deckert, Lorenz Hufe, Aljoscha Lipski, Yuni Amaloa Quintero Villalobos, Tae Kwon Rhee, Mengqian Shi, Lennart Stölting, Fabrizio Nunnari, Sebastian Möller
We present a collection of parallel corpora of 12 sign languages in video
format, together with subtitles in the dominant spoken languages of the
corresponding countries. The entire collection includes more than 1,300 hours
in 4,381 video files, accompanied by 1,3~M subtitles containing 14~M tokens.
Most notably, it includes the first consistent parallel corpora for 8 Latin
American sign languages, whereas the size of the German Sign Language corpora
is ten times the size of the previously available corpora. The collection was
created by collecting and processing videos of multiple sign languages from
various online sources, mainly broadcast material of news shows, governmental
bodies and educational channels. The preparation involved several stages,
including data collection, informing the content creators and seeking usage
approvals, scraping, and cropping. The paper provides statistics on the
collection and an overview of the methods used to collect the data.
☆ Can Language Models Critique Themselves? Investigating Self-Feedback for Retrieval Augmented Generation at BioASQ 2025
Agentic Retrieval Augmented Generation (RAG) and 'deep research' systems aim
to enable autonomous search processes where Large Language Models (LLMs)
iteratively refine outputs. However, applying these systems to domain-specific
professional search, such as biomedical research, presents challenges, as
automated systems may reduce user involvement and misalign with expert
information needs. Professional search tasks often demand high levels of user
expertise and transparency. The BioASQ CLEF 2025 challenge, using
expert-formulated questions, can serve as a platform to study these issues. We
explored the performance of current reasoning and nonreasoning LLMs like
Gemini-Flash 2.0, o3-mini, o4-mini and DeepSeek-R1. A key aspect of our
methodology was a self-feedback mechanism where LLMs generated, evaluated, and
then refined their outputs for query expansion and for multiple answer types
(yes/no, factoid, list, ideal). We investigated whether this iterative
self-correction improves performance and if reasoning models are more capable
of generating useful feedback. Preliminary results indicate varied performance
for the self-feedback strategy across models and tasks. This work offers
insights into LLM self-correction and informs future work on comparing the
effectiveness of LLM-generated feedback with direct human expert input in these
search systems.
comment: Version as accepted at the BioASQ Lab at CLEF 2025
☆ Evaluation of a Sign Language Avatar on Comprehensibility, User Experience \& Acceptability
Fenya Wasserroth, Eleftherios Avramidis, Vera Czehmann, Tanja Kojic, Fabrizio Nunnari, Sebastian Möller
This paper presents an investigation into the impact of adding adjustment
features to an existing sign language (SL) avatar on a Microsoft Hololens 2
device. Through a detailed analysis of interactions of expert German Sign
Language (DGS) users with both adjustable and non-adjustable avatars in a
specific use case, this study identifies the key factors influencing the
comprehensibility, the user experience (UX), and the acceptability of such a
system. Despite user preference for adjustable settings, no significant
improvements in UX or comprehensibility were observed, which remained at low
levels, amid missing SL elements (mouthings and facial expressions) and
implementation issues (indistinct hand shapes, lack of feedback and menu
positioning). Hedonic quality was rated higher than pragmatic quality,
indicating that users found the system more emotionally or aesthetically
pleasing than functionally useful. Stress levels were higher for the adjustable
avatar, reflecting lower performance, greater effort and more frustration.
Additionally, concerns were raised about whether the Hololens adjustment
gestures are intuitive and easy to familiarise oneself with. While
acceptability of the concept of adjustability was generally positive, it was
strongly dependent on usability and animation quality. This study highlights
that personalisation alone is insufficient, and that SL avatars must be
comprehensible by default. Key recommendations include enhancing mouthing and
facial animation, improving interaction interfaces, and applying participatory
design.
☆ Efficient Reasoning for Large Reasoning Language Models via Certainty-Guided Reflection Suppression
Recent Large Reasoning Language Models (LRLMs) employ long chain-of-thought
reasoning with complex reflection behaviors, typically signaled by specific
trigger words (e.g., "Wait" and "Alternatively") to enhance performance.
However, these reflection behaviors can lead to the overthinking problem where
the generation of redundant reasoning steps that unnecessarily increase token
usage, raise inference costs, and reduce practical utility. In this paper, we
propose Certainty-Guided Reflection Suppression (CGRS), a novel method that
mitigates overthinking in LRLMs while maintaining reasoning accuracy. CGRS
operates by dynamically suppressing the model's generation of reflection
triggers when it exhibits high confidence in its current response, thereby
preventing redundant reflection cycles without compromising output quality. Our
approach is model-agnostic, requires no retraining or architectural
modifications, and can be integrated seamlessly with existing autoregressive
generation pipelines. Extensive experiments across four reasoning benchmarks
(i.e., AIME24, AMC23, MATH500, and GPQA-D) demonstrate CGRS's effectiveness: it
reduces token usage by an average of 18.5% to 41.9% while preserving accuracy.
It also achieves the optimal balance between length reduction and performance
compared to state-of-the-art baselines. These results hold consistently across
model architectures (e.g., DeepSeek-R1-Distill series, QwQ-32B, and Qwen3
family) and scales (4B to 32B parameters), highlighting CGRS's practical value
for efficient reasoning.
comment: Technical Report
☆ A Novel Architecture for Symbolic Reasoning with Decision Trees and LLM Agents
We propose a hybrid architecture that integrates decision tree-based symbolic
reasoning with the generative capabilities of large language models (LLMs)
within a coordinated multi-agent framework. Unlike prior approaches that
loosely couple symbolic and neural modules, our design embeds decision trees
and random forests as callable oracles within a unified reasoning system.
Tree-based modules enable interpretable rule inference and causal logic, while
LLM agents handle abductive reasoning, generalization, and interactive
planning. A central orchestrator maintains belief state consistency and
mediates communication across agents and external tools, enabling reasoning
over both structured and unstructured inputs.
The system achieves strong performance on reasoning benchmarks. On
\textit{ProofWriter}, it improves entailment consistency by +7.2\% through
logic-grounded tree validation. On GSM8k, it achieves +5.3\% accuracy gains in
multistep mathematical problems via symbolic augmentation. On \textit{ARC}, it
boosts abstraction accuracy by +6.0\% through integration of symbolic oracles.
Applications in clinical decision support and scientific discovery show how the
system encodes domain rules symbolically while leveraging LLMs for contextual
inference and hypothesis generation. This architecture offers a robust,
interpretable, and extensible solution for general-purpose neuro-symbolic
reasoning.
☆ SONAR-LLM: Autoregressive Transformer that Thinks in Sentence Embeddings and Speaks in Tokens
The recently proposed Large Concept Model (LCM) generates text by predicting
a sequence of sentence-level embeddings and training with either mean-squared
error or diffusion objectives. We present SONAR-LLM, a decoder-only transformer
that "thinks" in the same continuous SONAR embedding space, yet is supervised
through token-level cross-entropy propagated via the frozen SONAR decoder. This
hybrid objective retains the semantic abstraction of LCM while eliminating its
diffusion sampler and restoring a likelihood-based training signal. Across
model sizes from 39M to 1.3B parameters, SONAR-LLM attains competitive
generation quality. We report scaling trends, ablations, benchmark results, and
release the complete training code and all pretrained checkpoints to foster
reproducibility and future research.
☆ Decision-Making with Deliberation: Meta-reviewing as a Document-grounded Dialogue
Meta-reviewing is a pivotal stage in the peer-review process, serving as the
final step in determining whether a paper is recommended for acceptance. Prior
research on meta-reviewing has treated this as a summarization problem over
review reports. However, complementary to this perspective, meta-reviewing is a
decision-making process that requires weighing reviewer arguments and placing
them within a broader context. Prior research has demonstrated that
decision-makers can be effectively assisted in such scenarios via dialogue
agents. In line with this framing, we explore the practical challenges for
realizing dialog agents that can effectively assist meta-reviewers. Concretely,
we first address the issue of data scarcity for training dialogue agents by
generating synthetic data using Large Language Models (LLMs) based on a
self-refinement strategy to improve the relevance of these dialogues to expert
domains. Our experiments demonstrate that this method produces higher-quality
synthetic data and can serve as a valuable resource towards training
meta-reviewing assistants. Subsequently, we utilize this data to train dialogue
agents tailored for meta-reviewing and find that these agents outperform
\emph{off-the-shelf} LLM-based assistants for this task. Finally, we apply our
agents in real-world meta-reviewing scenarios and confirm their effectiveness
in enhancing the efficiency of meta-reviewing.\footnote{Code and Data:
https://github.com/UKPLab/arxiv2025-meta-review-as-dialog
comment: 36 pages, 16 tables, 13 figures
☆ ASCoT: An Adaptive Self-Correction Chain-of-Thought Method for Late-Stage Fragility in LLMs
Chain-of-Thought (CoT) prompting has significantly advanced the reasoning
capabilities of Large Language Models (LLMs), yet the reliability of these
reasoning chains remains a critical challenge. A widely held "cascading
failure" hypothesis suggests that errors are most detrimental when they occur
early in the reasoning process. This paper challenges that assumption through
systematic error-injection experiments, revealing a counter-intuitive
phenomenon we term "Late-Stage Fragility": errors introduced in the later
stages of a CoT chain are significantly more likely to corrupt the final answer
than identical errors made at the beginning. To address this specific
vulnerability, we introduce the Adaptive Self-Correction Chain-of-Thought
(ASCoT) method. ASCoT employs a modular pipeline in which an Adaptive
Verification Manager (AVM) operates first, followed by the Multi-Perspective
Self-Correction Engine (MSCE). The AVM leverages a Positional Impact Score
function I(k) that assigns different weights based on the position within the
reasoning chains, addressing the Late-Stage Fragility issue by identifying and
prioritizing high-risk, late-stage steps. Once these critical steps are
identified, the MSCE applies robust, dual-path correction specifically to the
failure parts. Extensive experiments on benchmarks such as GSM8K and MATH
demonstrate that ASCoT achieves outstanding accuracy, outperforming strong
baselines, including standard CoT. Our work underscores the importance of
diagnosing specific failure modes in LLM reasoning and advocates for a shift
from uniform verification strategies to adaptive, vulnerability-aware
correction mechanisms.
♻ ☆ TreeDiff: AST-Guided Code Generation with Diffusion LLMs
Yiming Zeng, Jinghan Cao, Zexin Li, Yiming Chen, Tao Ren, Dawei Xiang, Xidong Wu, Shangqian Gao, Tingting Yu
Recent advances in diffusion-based language models have opened new
possibilities for controllable and bidirectional sequence generation. These
models provide an alternative to traditional autoregressive approaches by
framing text generation as an iterative denoising process. However, applying
diffusion models to structured domains such as source code remains a
significant challenge. Programming languages differ from natural language in
that they follow strict syntactic and semantic rules, with hierarchical
organization that must be preserved for correctness. Standard token-level
corruption techniques used during training often ignore this structure, which
may hinder the model's ability to learn meaningful representations of code. To
address this limitation, we propose a syntax-aware diffusion framework that
incorporates structural priors from Abstract Syntax Trees (ASTs) into the
denoising process. Instead of masking individual tokens at random, we
selectively corrupt syntactically meaningful code spans derived from AST
subtrees. This enables the model to reconstruct programs in a way that respects
grammatical boundaries and captures long-range dependencies. Experimental
results demonstrate that syntax-aware corruption significantly improves
syntactic correctness, reconstruction accuracy, and generalization to unseen
code patterns. These findings highlight the potential of incorporating
structural information into diffusion-based training and suggest that
syntax-guided denoising is a promising direction for advancing diffusion-based
language models in code generation tasks.
♻ ☆ An Entity Linking Agent for Question Answering
Yajie Luo, Yihong Wu, Muzhi Li, Fengran Mo, Jia Ao Sun, Xinyu Wang, Liheng Ma, Yingxue Zhang, Jian-Yun Nie
Some Question Answering (QA) systems rely on knowledge bases (KBs) to provide
accurate answers. Entity Linking (EL) plays a critical role in linking natural
language mentions to KB entries. However, most existing EL methods are designed
for long contexts and do not perform well on short, ambiguous user questions in
QA tasks. We propose an entity linking agent for QA, based on a Large Language
Model that simulates human cognitive workflows. The agent actively identifies
entity mentions, retrieves candidate entities, and makes decision. To verify
the effectiveness of our agent, we conduct two experiments: tool-based entity
linking and QA task evaluation. The results confirm the robustness and
effectiveness of our agent.
comment: 12 pages, 2 figures
♻ ☆ SciReplicate-Bench: Benchmarking LLMs in Agent-driven Algorithmic Reproduction from Research Papers
This study evaluates large language models (LLMs) in generating code from
algorithm descriptions in recent NLP papers. The task requires two key
competencies: (1) algorithm comprehension: synthesizing information from papers
and academic literature to understand implementation logic, and (2) coding
expertise: identifying dependencies and correctly implementing necessary APIs.
To facilitate rigorous evaluation, we introduce SciReplicate-Bench, a benchmark
of 100 tasks from 36 NLP papers published in 2024, featuring detailed
annotations and comprehensive test cases. Building on SciReplicate-Bench, we
propose Sci-Reproducer, a dual-agent framework consisting of a Paper Agent that
interprets algorithmic concepts from literature and a Code Agent that retrieves
dependencies from repositories and implements solutions. To assess algorithm
understanding, we introduce reasoning graph accuracy, which quantifies
similarity between generated and reference reasoning graphs derived from code
comments and structure. For evaluating implementation quality, we employ
execution accuracy, CodeBLEU, and repository dependency/API recall metrics. In
our experiments, we evaluate various powerful non-reasoning and reasoning LLMs
as foundational models. The best-performing LLM using \ModelName~achieves only
39% execution accuracy, highlighting the benchmark's difficulty. Our analysis
identifies missing or inconsistent algorithm descriptions as key barriers to
successful reproduction. We make available our benchmark and code at
https://github.com/xyzCS/SciReplicate-Bench and project homepage at
https://xyzcs.github.io/scireplicate.github.io/.
♻ ☆ Improving Factuality for Dialogue Response Generation via Graph-Based Knowledge Augmentation
Large Language Models (LLMs) succeed in many natural language processing
tasks. However, their tendency to hallucinate - generate plausible but
inconsistent or factually incorrect text - can cause significant problems in
certain tasks, including response generation in dialogue. To mitigate this
issue, we propose two novel graph knowledge-augmented frameworks, Dialogue
Response Generation via Textualised Graphs (TG-DRG) and Graph-Aware Dialogue
Response Generation (GA-DRG), which combine reasoning-guided dialogue
reformulation, dialogue sense knowledge selection, and graph-enhanced response
generation to improve the factuality of dialogue responses. To evaluate the
factuality of generated responses, we propose a dialogue fact score that
addresses the limitations of existing fact-score methods in dialogue settings,
providing a more reliable assessment of factual consistency. We evaluate our
methods using different baselines on the OpendialKG and HybriDialogue datasets.
Our methods noticeably improve factuality compared to other graph
knowledge-augmentation baselines, including the state-of-the-art G-retriever,
achieving improvements of 3.47% on OpendialKG and 3.12% on HybriDialogue in
terms of dialogue fact score. The code will be released on GitHub.
♻ ☆ Teaching LLMs How to Learn with Contextual Fine-Tuning ICLR 2025
Prompting Large Language Models (LLMs), or providing context on the expected
model of operation, is an effective way to steer the outputs of such models to
satisfy human desiderata after they have been trained. But in rapidly evolving
domains, there is often need to fine-tune LLMs to improve either the kind of
knowledge in their memory or their abilities to perform open ended reasoning in
new domains. When human's learn new concepts, we often do so by linking the new
material that we are studying to concepts we have already learned before. To
that end, we ask, "can prompting help us teach LLMs how to learn". In this
work, we study a novel generalization of instruction tuning, called contextual
fine-tuning, to fine-tune LLMs. Our method leverages instructional prompts
designed to mimic human cognitive strategies in learning and problem-solving to
guide the learning process during training, aiming to improve the model's
interpretation and understanding of domain-specific knowledge. We empirically
demonstrate that this simple yet effective modification improves the ability of
LLMs to be fine-tuned rapidly on new datasets both within the medical and
financial domains.
comment: ICLR 2025
♻ ☆ BloomWise: Enhancing Problem-Solving capabilities of Large Language Models using Bloom's-Taxonomy-Inspired Prompts
Despite the remarkable capabilities of large language models (LLMs) across a
range of tasks, mathematical reasoning remains a challenging frontier.
Motivated by the observation that humans learn more effectively when prompted
not what to think but how to think, we introduce BloomWise, a
cognitively-inspired prompting technique designed to enhance LLMs' performance
on mathematical problem solving while making their solutions more explainable.
BloomWise encourages LLMs to generate solutions - in the form of explanations -
by progressing through a sequence of cognitive operations-from basic (e.g.,
remembering) to more advanced reasoning skills (e.g., evaluating) - mirroring
how humans build understanding. The process iterates through these levels,
halting early if a convergence criterion is met: specifically, if two or more
consecutive levels yield the same answer, the solution from the earliest such
level is output; otherwise, the process continues until all levels are
completed. Through extensive experiments across five popular math reasoning
datasets, we demonstrate the effectiveness of BloomWise. We also present
comprehensive ablation studies to analyze the strengths of each component
within our system.
comment: 16 pages, 2 figures
♻ ☆ Human Cognitive Benchmarks Reveal Foundational Visual Gaps in MLLMs
Jen-Tse Huang, Dasen Dai, Jen-Yuan Huang, Youliang Yuan, Xiaoyuan Liu, Wenxuan Wang, Wenxiang Jiao, Pinjia He, Zhaopeng Tu, Haodong Duan
Despite significant progress on popular multimodal benchmarks,
state-of-the-art Multimodal Large Language Models (MLLMs) continue to struggle
with basic visual reasoning tasks that are trivially solved by humans, such as
recognizing spatial relationships. To systematically investigate this gap, we
introduce VisFactor, a benchmark that digitizes 20 vision-centric subtests from
a well-established cognitive psychology assessment. These subtests span four
core domains of human visual cognition: (1) Visualization and Spatial
Processing, (2) Perceptual and Closure, (3) Memory, and (4) Reasoning. We
evaluate 20 frontier MLLMs from GPT, Gemini, Claude, LLaMA, Qwen, and SEED
families. The best-performing model achieves a score of only 25.19 out of 100,
with consistent failures on tasks such as mental rotation, spatial relation
inference, and figure-ground discrimination, regardless of model size or
prompting strategy. These findings suggest that current MLLM performance gains
on high-level benchmarks do not reflect human-like low-level visual cognition,
challenging the assumption that large-scale pretraining naturally induces
gestalt-like perceptual capabilities. The dataset and evaluation toolkit are
publicly available at: https://github.com/CUHK-ARISE/VisFactor.
comment: Update: Evaluated 20 MLLMs; Added generated test cases
♻ ☆ Language Model Uncertainty Quantification with Attention Chain
Accurately quantifying a large language model's (LLM) predictive uncertainty
is crucial for judging the reliability of its answers. While most existing
research focuses on short, directly answerable questions with closed-form
outputs (e.g., multiple-choice), involving intermediate reasoning steps in LLM
responses is increasingly important. This added complexity complicates
uncertainty quantification (UQ) because the probabilities assigned to answer
tokens are conditioned on a vast space of preceding reasoning tokens. Direct
marginalization is infeasible, and the dependency inflates probability
estimates, causing overconfidence in UQ. To address this, we propose UQAC, an
efficient method that narrows the reasoning space to a tractable size for
marginalization. UQAC iteratively constructs an "attention chain" of tokens
deemed "semantically crucial" to the final answer via a backtracking procedure.
Starting from the answer tokens, it uses attention weights to identify the most
influential predecessors, then iterates this process until reaching the input
tokens. The resulting chain is further refined with similarity filtering and
probability thresholding, which reduce the reasoning space, facilitating the
approximation of the marginal answer token probabilities. We validate UQAC on
multiple reasoning benchmarks with advanced open-source LLMs, demonstrating
that it consistently delivers reliable UQ estimates with high computational
efficiency.
comment: 36 pages, 7 figures, 36 tables
♻ ☆ Enabling On-Device Medical AI Assistants via Input-Driven Saliency Adaptation
Large Language Models (LLMs) have significant impact on the healthcare
scenarios but remain prohibitively large for deployment in real-time,
resource-constrained environments such as edge devices. In this work, we
introduce a novel medical assistant system, optimized through our
general-purpose compression framework, which tailors Large Language Models
(LLMs) for deployment in specialized domains. By measuring neuron saliency on
domain-specific data, our method can aggressively prune irrelevant neurons,
reducing model size while preserving performance. Following pruning, we apply
post-training quantization to further reduce the memory footprint, and evaluate
the compressed model across medical benchmarks including MedMCQA, MedQA, and
PubMedQA. We also deploy the 50\% compressed Gemma and the 67\% compressed
LLaMA3 models on Jetson Orin Nano (18.7W peak) and Raspberry Pi 5 (6.3W peak),
achieving real-time, energy-efficient inference under hardware constraints.
comment: Accepted for publication in the Proceedings of IEEE BioCAS 2025
♻ ☆ Understanding Large Language Model Behaviors through Interactive Counterfactual Generation and Analysis
Furui Cheng, Vilém Zouhar, Robin Shing Moon Chan, Daniel Fürst, Hendrik Strobelt, Mennatallah El-Assady
Understanding the behavior of large language models (LLMs) is crucial for
ensuring their safe and reliable use. However, existing explainable AI (XAI)
methods for LLMs primarily rely on word-level explanations, which are often
computationally inefficient and misaligned with human reasoning processes.
Moreover, these methods often treat explanation as a one-time output,
overlooking its inherently interactive and iterative nature. In this paper, we
present LLM Analyzer, an interactive visualization system that addresses these
limitations by enabling intuitive and efficient exploration of LLM behaviors
through counterfactual analysis. Our system features a novel algorithm that
generates fluent and semantically meaningful counterfactuals via targeted
removal and replacement operations at user-defined levels of granularity. These
counterfactuals are used to compute feature attribution scores, which are then
integrated with concrete examples in a table-based visualization, supporting
dynamic analysis of model behavior. A user study with LLM practitioners and
interviews with experts demonstrate the system's usability and effectiveness,
emphasizing the importance of involving humans in the explanation process as
active participants rather than passive recipients.
♻ ☆ Can open source large language models be used for tumor documentation in Germany? -- An evaluation on urological doctors' notes
Tumor documentation in Germany is largely done manually, requiring reading
patient records and entering data into structured databases. Large language
models (LLMs) could potentially enhance this process by improving efficiency
and reliability. This evaluation tests eleven different open source LLMs with
sizes ranging from 1-70 billion model parameters on three basic tasks of the
tumor documentation process: identifying tumor diagnoses, assigning ICD-10
codes, and extracting the date of first diagnosis. For evaluating the LLMs on
these tasks, a dataset of annotated text snippets based on anonymized doctors'
notes from urology was prepared. Different prompting strategies were used to
investigate the effect of the number of examples in few-shot prompting and to
explore the capabilities of the LLMs in general. The models Llama 3.1 8B,
Mistral 7B, and Mistral NeMo 12 B performed comparably well in the tasks.
Models with less extensive training data or having fewer than 7 billion
parameters showed notably lower performance, while larger models did not
display performance gains. Examples from a different medical domain than
urology could also improve the outcome in few-shot prompting, which
demonstrates the ability of LLMs to handle tasks needed for tumor
documentation. Open source LLMs show a strong potential for automating tumor
documentation. Models from 7-12 billion parameters could offer an optimal
balance between performance and resource efficiency. With tailored fine-tuning
and well-designed prompting, these models might become important tools for
clinical documentation in the future. The code for the evaluation is available
from https://github.com/stefan-m-lenz/UroLlmEval. We also release the dataset
as a new valuable resource that addresses the shortage of authentic and easily
accessible benchmarks in German-language medical NLP.
comment: 53 pages, 5 figures
♻ ☆ Hierarchical Budget Policy Optimization for Adaptive Reasoning
Shangke Lyu, Linjuan Wu, Yuchen Yan, Xingyu Wu, Hao Li, Yongliang Shen, Peisheng Jiang, Weiming Lu, Jun Xiao, Yueting Zhuang
Large reasoning models achieve remarkable performance through extensive
chain-of-thought generation, yet they suffer from a critical inefficiency:
applying uniformly extensive reasoning regardless of problem complexity. We
present Hierarchical Budget Policy Optimization (HBPO), a reinforcement
learning framework that enables models to learn problem-specific reasoning
depths without sacrificing capability. Unlike existing approaches that impose
rigid constraints or rely on discrete mode selection, HBPO partitions the
exploration space into budget-constrained hierarchies (512-2560 tokens), each
with differentiated reward structures that preserve both efficiency incentives
and reasoning capabilities. This design addresses a fundamental challenge in
efficient reasoning training: traditional length penalties systematically bias
models away from necessary long reasoning paths, causing exploration space
collapse. Through hierarchical sampling and budget-aware rewards, HBPO
maintains exploration diversity while teaching models to recognize when
extended deliberation is warranted. Extensive experiments demonstrate that HBPO
reduces average token usage by up to 60.6% while improving accuracy by 3.14%
across four reasoning benchmarks. Most notably, HBPO exhibits emergent adaptive
behavior where models automatically adjust reasoning depth based on problem
complexity. Our results suggest that reasoning efficiency and capability are
not inherently conflicting, and can be simultaneously optimized through
appropriately structured hierarchical training that preserves exploration
diversity.
comment: Code: https://github.com/zju-real/hbpo Project
Page:https://zju-real.github.io/hbpo/
♻ ☆ PolyGuard: A Multilingual Safety Moderation Tool for 17 Languages
Priyanshu Kumar, Devansh Jain, Akhila Yerukola, Liwei Jiang, Himanshu Beniwal, Thomas Hartvigsen, Maarten Sap
Truly multilingual safety moderation efforts for Large Language Models (LLMs)
have been hindered by a narrow focus on a small set of languages (e.g.,
English, Chinese) as well as a limited scope of safety definition, resulting in
significant gaps in moderation capabilities. To bridge these gaps, we release
POLYGUARD, a new state-of-the-art multilingual safety model for safeguarding
LLM generations, and the corresponding training and evaluation datasets.
POLYGUARD is trained on POLYGUARDMIX, the largest multilingual safety training
corpus to date containing 1.91M samples across 17 languages (e.g., Chinese,
Czech, English, Hindi). We also introduce POLYGUARDPROMPTS, a high quality
multilingual benchmark with 29K samples for the evaluation of safety
guardrails. Created by combining naturally occurring multilingual human-LLM
interactions and human-verified machine translations of an English-only safety
dataset (WildGuardMix; Han et al., 2024), our datasets contain prompt-output
pairs with labels of prompt harmfulness, response harmfulness, and response
refusal. Through extensive evaluations across multiple safety and toxicity
benchmarks, we demonstrate that POLYGUARD outperforms existing state-of-the-art
open-weight and commercial safety classifiers by 5.5%. Our contributions
advance efforts toward safer multilingual LLMs for all global users.
comment: Accepted to COLM 2025 Main Conference
♻ ☆ Recent Advances in Speech Language Models: A Survey ACL 2025
Wenqian Cui, Dianzhi Yu, Xiaoqi Jiao, Ziqiao Meng, Guangyan Zhang, Qichao Wang, Yiwen Guo, Irwin King
Large Language Models (LLMs) have recently garnered significant attention,
primarily for their capabilities in text-based interactions. However, natural
human interaction often relies on speech, necessitating a shift towards
voice-based models. A straightforward approach to achieve this involves a
pipeline of ``Automatic Speech Recognition (ASR) + LLM + Text-to-Speech (TTS)",
where input speech is transcribed to text, processed by an LLM, and then
converted back to speech. Despite being straightforward, this method suffers
from inherent limitations, such as information loss during modality conversion,
significant latency due to the complex pipeline, and error accumulation across
the three stages. To address these issues, Speech Language Models (SpeechLMs)
-- end-to-end models that generate speech without converting from text -- have
emerged as a promising alternative. This survey paper provides the first
comprehensive overview of recent methodologies for constructing SpeechLMs,
detailing the key components of their architecture and the various training
recipes integral to their development. Additionally, we systematically survey
the various capabilities of SpeechLMs, categorize their evaluation metrics, and
discuss the challenges and future research directions in this rapidly evolving
field. The GitHub repository is available at
https://github.com/dreamtheater123/Awesome-SpeechLM-Survey
comment: The reduced version of this paper has been accepted at ACL 2025
♻ ☆ A Latent-Variable Model for Intrinsic Probing
The success of pre-trained contextualized representations has prompted
researchers to analyze them for the presence of linguistic information. Indeed,
it is natural to assume that these pre-trained representations do encode some
level of linguistic knowledge as they have brought about large empirical
improvements on a wide variety of NLP tasks, which suggests they are learning
true linguistic generalization. In this work, we focus on intrinsic probing, an
analysis technique where the goal is not only to identify whether a
representation encodes a linguistic attribute but also to pinpoint where this
attribute is encoded. We propose a novel latent-variable formulation for
constructing intrinsic probes and derive a tractable variational approximation
to the log-likelihood. Our results show that our model is versatile and yields
tighter mutual information estimates than two intrinsic probes previously
proposed in the literature. Finally, we find empirical evidence that
pre-trained representations develop a cross-lingually entangled notion of
morphosyntax.
♻ ☆ From Code to Correctness: Closing the Last Mile of Code Generation with Hierarchical Debugging
While large language models have made significant strides in code generation,
the pass rate of the generated code is bottlenecked on subtle errors, often
requiring human intervention to pass tests, especially for complex problems.
Existing LLM-based debugging systems treat generated programs as monolithic
units, failing to address bugs at multiple levels of granularity, from
low-level syntax errors to high-level algorithmic flaws. In this paper, we
introduce Multi-Granularity Debugger (MGDebugger), a hierarchical code debugger
by isolating, identifying, and resolving bugs at various levels of granularity.
MGDebugger decomposes problematic code into a hierarchical tree structure of
subfunctions, with each level representing a particular granularity of error.
During debugging, it analyzes each subfunction and iteratively resolves bugs in
a bottom-up manner. To effectively test each subfunction, we propose an
LLM-simulated Python executor, which traces code execution and tracks important
variable states to pinpoint errors accurately. Extensive experiments
demonstrate that MGDebugger outperforms existing debugging systems, achieving
an 18.9% improvement in accuracy over seed generations in HumanEval and a 97.6%
repair success rate in HumanEvalFix. Furthermore, MGDebugger effectively fixes
bugs across different categories and difficulty levels, demonstrating its
robustness and effectiveness.
comment: Code and data available at https://github.com/YerbaPage/MGDebugger
♻ ☆ GuARD: Effective Anomaly Detection through a Text-Rich and Graph-Informed Language Model KDD 2025
Anomaly detection on text-rich graphs is widely prevalent in real life, such
as detecting incorrectly assigned academic papers to authors and detecting bots
in social networks. The remarkable capabilities of large language models (LLMs)
pave a new revenue by utilizing rich-text information for effective anomaly
detection. However, simply introducing rich texts into LLMs can obscure
essential detection cues and introduce high fine-tuning costs. Moreover, LLMs
often overlook the intrinsic structural bias of graphs which is vital for
distinguishing normal from abnormal node patterns. To this end, this paper
introduces GuARD, a text-rich and graph-informed language model that combines
key structural features from graph-based methods with fine-grained semantic
attributes extracted via small language models for effective anomaly detection
on text-rich graphs. GuARD is optimized with the progressive multi-modal
multi-turn instruction tuning framework in the task-guided instruction tuning
regime tailed to incorporate both rich-text and structural modalities.
Extensive experiments on four datasets reveal that GuARD outperforms
graph-based and LLM-based anomaly detection methods, while offering up to
5$\times$ times speedup in training and 5$\times$ times speedup in inference
over vanilla long-context LLMs on the large-scale WhoIsWho dataset.
comment: Accepted at KDD 2025
♻ ☆ IFDECORATOR: Wrapping Instruction Following Reinforcement Learning with Verifiable Rewards
Xu Guo, Tianyi Liang, Tong Jian, Xiaogui Yang, Ling-I Wu, Chenhui Li, Zhihui Lu, Qipeng Guo, Kai Chen
Reinforcement Learning with Verifiable Rewards (RLVR) improves instruction
following capabilities of large language models (LLMs), but suffers from
training inefficiency due to inadequate difficulty assessment. Moreover, RLVR
is prone to over-optimization, where LLMs exploit verification shortcuts
without aligning to the actual intent of user instructions. We introduce
Instruction Following Decorator (IFDecorator}, a framework that wraps RLVR
training into a robust and sample-efficient pipeline. It consists of three
components: (1) a cooperative-adversarial data flywheel that co-evolves
instructions and hybrid verifications, generating progressively more
challenging instruction-verification pairs; (2) IntentCheck, a bypass module
enforcing intent alignment; and (3) trip wires, a diagnostic mechanism that
detects reward hacking via trap instructions, which trigger and capture
shortcut exploitation behaviors. Our Qwen2.5-32B-Instruct-IFDecorator achieves
87.43% accuracy on IFEval, outperforming larger proprietary models such as
GPT-4o. Additionally, we demonstrate substantial improvements on FollowBench
while preserving general capabilities. Our trip wires show significant
reductions in reward hacking rates. We will release models, code, and data for
future research.
comment: 7 pages, 4 figures